Sliding window list decoder for error correcting codes

ABSTRACT

A system for hardware error-correcting code (ECC) detection or correction of a received codeword from an original codeword includes an error-detecting circuit configured to process a selection of symbols of the received codeword using a set of factors, the original codeword being recomputable from a corresponding said selection of symbols of the original codeword using the set of factors. The error-detecting circuit includes a hardware multiplier and accumulator configured to use the set of factors and the selection of symbols of the received codeword to recompute remaining symbols of the original codeword, and a hardware comparator configured to compare the recomputed remaining symbols of the original codeword with corresponding said remaining symbols of the received codeword and to output first results of this comparison.

CROSS REFERENCE TO RELATED APPLICATIONS

This application is a continuation of U.S. patent application Ser. No.15/650,806, filed on Jul. 14, 2017, which is a continuation of U.S.patent application Ser. No. 14/492,685, filed on Sep. 22, 2014, now U.S.Pat. No. 9,722,632, issued on Aug. 1, 2017, the entire contents of bothof which are incorporated herein by reference.

This application is related to U.S. patent application Ser. No.13/727,581 (hereinafter “the Reference Application”), entitled “USINGPARITY DATA FOR CONCURRENT DATA AUTHENTICATION, CORRECTION, COMPRESSION,AND ENCRYPTION,” filed on Dec. 26, 2012, now U.S. Pat. No. 8,914,706,issued on Dec. 16, 2014, the entire content of which is alsoincorporated herein by reference.

BACKGROUND 1. Field

Aspects of embodiments of the present invention are directed toward asliding window list decoder for error correcting codes.

2. Description of Related Art

Communication and storage technologies are rapidly evolving. 400 Gbit(gigabit) communication systems are in design, while 100 Tbit (terabit)systems are predicted by 2020. Given this growth, error correction powerneeds to scale up with transmission speed. As transmission ratesincrease, burst error lengths increase proportionally. For example, thesame single bit error event at 10 Mbit (megabit) can affect 1000 bits at10 Gbit. Traditional serial error correcting code (ECC) decoders (circa1960) scale badly, and are guaranteed to silently corrupt data for somelarge errors. For the most part, they are all based on the samemathematics: derive key equation, solve the equation, then recover thecorrect data. Their latency increases with the correction power (e.g.,number of symbols or entries, where each symbol or entry is somequantity of data, such as a byte), and their likelihood to silentlycorrupt data increases with error size.

Reed-Solomon error correction and other applications of ECC (such asdata verification, encryption, and compression) are described in theReference Application. Example solutions of Reed-Solomon errorcorrection (such as the Welch-Berlekamp algorithm, or just Berlekamp)are described in Mann, The Original View of Reed-Solomon Coding and theWelch-Berlekamp Decoding Algorithm, Ph.D. dissertation, Univ. ofArizona, 2013, the entire content of which is incorporated by reference.Maximum likelihood decoding of Reed-Solomon codes is an NP-hard problem.In particular, there is no known algorithm to choose the most likelydecoding for K data symbols and T check (or parity) symbols (i.e., forcodewords of N=K+T symbols) having up to T−1 errors by any algorithm inpolynomial time with respect to N.

Put another way, the processing complexity of any such known algorithmgrows at least exponentially with N. The majority of existingReed-Solomon error correction techniques, such as Welch-Berlekamp, areserial in nature, and require multiple clock cycles (the clock cyclecount being proportional to N) to perform their correction. For morethan T/2 errors, decoders such as Welch-Berlekamp have a design flawsuch that cases with more than T/2 errors but fewer than T errors can bemistaken for cases with fewer than T/2, resulting in silent datacorruption.

SUMMARY

Aspects of embodiments of the present invention are directed toward asliding window list decoder for error correcting codes. In particular,aspects are directed toward the use of multiple Parallel ErrorCorrectors (PECs) for significantly speeding up the processing ofReed-Solomon encoded codewords, and eliminating many cases of silentdata corruption. Applications include erasure code data verification,correction, encryption, and compression, alone or in combination witheach other, as described in the Reference Application.

In an embodiment of the present invention, multiple Reed-Solomon matrixmultipliers are implemented in hardware and run concurrently. Eachmultiplier recomputes some portion of a received codeword. It should benoted that, as used throughout, a “received codeword” (or “suppliedcodeword”) may include the original codeword modified by some errorword, which may or may not be 0. As such, a received codeword does nothave to be a valid, consistent codeword. Based on the consistency of theregenerated symbols to the received symbols, different actions may betaken. For example, depending on the consistency of the regeneratedsymbols between multipliers, the multiplier(s) that produces the fewestnumber of errors may be selected for error correction. In anotherembodiment, the position of all errors (such as burst errors) smallerthan length T may be identified.

In another embodiment, if any one multiplier corrects the fewest numberof symbols, and it corrects no more symbols than half the number ofparity symbols, then that multiplier is selected for error correction.In yet another embodiment, based on the consistency of the regeneratedsymbols to other regenerated symbols, the position of all possibleerrors smaller than length T can be identified.

According to some embodiments of the present invention, parallel ECC orsliding window list decoders are used in place of serial techniques.Parallel ECC and sliding window list decoders scale better than otherapproaches. For example, they have fixed latency and regular structure,the separate ECC decoders can do all of their processing independentlyof each other (thus making them readily parallelizable), theircorrection power scales with space (gate count) rather than time(frequency or gate speed), and they can easily be partitioned andrealized as with a multi-core CPU. Parallel ECC and sliding window listdecoders achieve faster correction through using more gates, not fastergates, extending the use of Reed Solomon decoders to new applications,such as low latency DRAM or Flash memory systems.

According to some other embodiments of the present invention, slidingwindow list decoders are used in concert with serial techniques. Manyexisting techniques such as the Welch-Berlekamp algorithm only correctup to T/2 errors and can sometimes miscorrect more than T/2 errors evenwhen they are caused by burst errors. The sliding window list decodercan provide information that can help other techniques both correct moreerrors and avoid miscorrections. The sliding window list decoder is ableto do this by using more gates and more processing power, as well asexploiting situations where burst errors are more likely than sparseerrors, which may happen in many real world situations.

According to an embodiment of the present invention, a system forhardware error-correcting code (ECC) detection or correction of areceived codeword from an original codeword is provided. The systemincludes an error-detecting circuit configured to process a selection ofsymbols of the received codeword using a set of factors, the originalcodeword being recomputable from a corresponding said selection ofsymbols of the original codeword using the set of factors. Theerror-detecting circuit includes a hardware multiplier and accumulatorconfigured to use the set of factors and the selection of symbols of thereceived codeword to recompute remaining symbols of the originalcodeword, and a hardware comparator configured to compare the recomputedremaining symbols of the original codeword with corresponding saidremaining symbols of the received codeword and to output first resultsof this comparison.

The one of the error-detecting circuits may be further configured tooutput the selection of symbols of the received codeword and therecomputed remaining symbols of the original codeword as anerror-corrected codeword based on the first results.

The hardware comparator may be further configured to compare therecomputed remaining symbols of the original codeword with the remainingsymbols of the received codeword by counting a first number of therecomputed remaining symbols of the original codeword that equalrespective ones of the remaining symbols of the received codeword, andthe first results may include the first number.

The error-detecting circuit may be further configured to output theselection of symbols of the received codeword and the recomputedremaining symbols of the original codeword as an error-correctedcodeword based on the first number.

The error-detecting circuit may be further configured to output theerror-corrected codeword when the first number is at least as large as afirst threshold.

The first threshold may be one-half of a number of the remaining symbolsof the received codeword.

The error-detecting circuit may be further configured to output an errorindicator when the first number is smaller than the first threshold.

The error-detecting circuit may include plurality of error-detectingcircuits each configured to process a different said selection ofsymbols of the received codeword using a different said set of factors.

Each of the error-detecting circuits may be further configured to outputthe selection of symbols of the received codeword and the recomputedremaining symbols of the original codeword as an error-correctedcodeword based on the first results.

The system may further include a multiplexor configured to output theerror-corrected codeword from one of the error-detecting circuits basedon the first results from all of the error-detecting circuits.

The hardware comparator may be further configured to compare therecomputed remaining symbols of the original codeword with the remainingsymbols of the received codeword by counting a first number of therecomputed remaining symbols of the original codeword that equalrespective ones of the remaining symbols of the received codeword, andthe first results may include the first number.

Each of the error-detecting circuits may be further configured to outputthe selection of symbols of the received codeword and the recomputedremaining symbols of the original codeword as an error-correctedcodeword based on the first number.

The system may further include a multiplexor configured to output theerror-corrected codeword from one of the error-detecting circuits whenthe first number from the one of the error-detecting circuits is alargest said first number from all of the error-detecting circuits.

The multiplexor may be further configured to output the error-correctedcodeword from the one of the error-detecting circuits when the largestfirst number is at least as large as a first threshold.

The first threshold may be one-half of a number of the remaining symbolsof the received codeword.

The system may be configured to output an error indicator when thelargest first number is smaller than the first threshold.

When more than one of the error-detecting circuits has the largest firstnumber, the system may be configured to output an error indicator whenthe error-corrected codeword from each of the more than one of theerror-detecting circuits having the largest first number are not equal.

According to another embodiment of the present invention, a system forhardware error-correcting code (ECC) detection or correction of areceived codeword from an original codeword is provided. The systemincludes a plurality of error-detecting circuits each configured toprocess a different selection of symbols of the received codeword usinga different set of factors, some or all remaining symbols of theoriginal codeword being recomputable from a corresponding said selectionof symbols of the original codeword using the set of factors. Each ofthe error-detecting circuits includes a hardware multiplier andaccumulator configured to use the set of factors and the selection ofsymbols of the received codeword to recompute the some or all remainingsymbols of the original codeword, and a hardware comparator configuredto compare one or more of the recomputed remaining symbols of theoriginal codeword with a corresponding said one or more remainingsymbols of the received codeword or the recomputed remaining symbols ofthe original codeword from another one of the error-detecting circuits,and to output first results of this comparison.

Each of the error-detecting circuits may be further configured to outputat least one of the recomputed remaining symbols of the originalcodeword as error-corrected symbols of the received codeword based onthe first results.

The first results may include complete equality of the comparedrecomputed remaining symbols of the original codeword.

The symbols of the received codeword may be ordered, and each of theerror-detecting circuits may be further configured to process adifferent consecutive selection of symbols of the received codeword.

A number of the error-detecting circuits may equal a number of thesymbols of the received codeword.

The error-detecting circuits may be ordered by respective first symbolsof their corresponding consecutive selections of symbols of the receivedcodeword, and for each of the error-detecting circuits, the other one ofthe error-detecting circuits may be a consecutive one of theerror-detecting circuits.

According to yet another embodiment of the present invention, a methodof hardware error-correcting code (ECC) detection or correction of areceived codeword from an original codeword using a hardwareerror-detecting circuit is provided. The method includes processing bythe error-detecting circuit a selection of symbols of the receivedcodeword using a set of factors, the original codeword beingrecomputable from a corresponding said selection of symbols of theoriginal codeword using the set of factors. The processing includesrecomputing remaining symbols of the original codeword by using ahardware multiplier and accumulator on the set of factors and theselection of symbols of the received codeword, comparing by a hardwarecomparator the recomputed remaining symbols of the original codewordwith corresponding said remaining symbols of the received codeword, andoutputting first results of this comparison.

The method may further include outputting by the error-detecting circuitthe selection of symbols of the received codeword and the recomputedremaining symbols of the original codeword as an error-correctedcodeword based on the first results.

The error-detecting circuit may include a plurality of error-detectingcircuits, and the processing by the error-detecting circuit of theselection of symbols of the received codeword using the set of factorsmay include processing by each of the plurality of error-detectingcircuits a different said selection of symbols of the received codewordusing a different said set of factors.

The method may further include outputting by each of the error-detectingcircuits the selection of symbols of the received codeword and therecomputed remaining symbols of the original codeword as anerror-corrected codeword based on the first results.

The method may further include outputting by a multiplexor theerror-corrected codeword from one of the error-detecting circuits basedon the first results from all of the error-detecting circuits.

According to still yet another embodiment of the present invention, asystem configured to perform error-correcting code (ECC) detection orcorrection of a received codeword from an original codeword is provided.The system includes a processor, a non-transitory storage deviceconfigured to store computer programming instructions, and a memory. Thememory has a set of the instructions stored thereon that, when executedby the processor, causes the processor to process different selectionsof symbols of the received codeword using corresponding different setsof factors. For each selection of symbols of the different selections ofsymbols and corresponding set of factors of the different sets offactors, some or all remaining symbols of the original codeword arerecomputable from a corresponding said selection of symbols of theoriginal codeword using the set of factors. The processing of theselection of symbols of the received codeword using the set of factorsincludes recomputing the some or all remaining symbols of the originalcodeword through multiplication and accumulation using the set offactors and the selection of symbols of the received codeword, andcomparing one or more of the recomputed remaining symbols of theoriginal codeword with a corresponding said one or more remainingsymbols of the received codeword or the recomputed remaining symbols ofthe original codeword from the processing of another one of thedifferent selections of symbols of the received codeword. Theinstructions, when executed by the processor, further causes theprocessor to output first results of the comparisons.

The instructions, when executed by the processor, may further cause theprocessor to output at least one of the recomputed remaining symbols ofthe original codeword as error-corrected symbols of the receivedcodeword based on the first results.

The first results may include complete equality of the comparedrecomputed remaining symbols of the original codeword.

The symbols of the received codeword may be ordered, and the differentselections of symbols of the received codeword may include differentconsecutive selections of symbols of the received codeword.

A number of the different selections of symbols of the received codewordmay equal a number of the symbols of the received codeword.

The different selections of symbols of the received codeword may beordered by respective first symbols of their corresponding consecutiveselections of symbols of the received codeword, and for the processingof the selection of symbols of the received codeword using the set offactors, the processing of the other one of the different selections ofsymbols of the received codeword includes the processing of aconsecutive one of the different selections of symbols of the receivedcodeword.

The recomputing of the some or all remaining symbols of the originalcodeword through multiplication and accumulation using the set offactors and the selection of symbols of the received codeword mayinclude, for each symbol of the selection of symbols of the receivedcodeword, performing the multiplication of the symbol by each factor inthe set of factors in parallel using a pre-built table customized to theset of factors.

According to yet another embodiment of the present invention, a methodof hardware error-correcting code (ECC) detection or correction of areceived codeword from an original codeword using a plurality ofhardware error-detecting circuits is provided. The method includesprocessing by each of the error-detecting circuits a different selectionof symbols of the received codeword using a different set of factors,some or all remaining symbols of the original codeword beingrecomputable from a corresponding said selection of symbols of theoriginal codeword using the set of factors. The processing includesrecomputing the some or all remaining symbols of the original codewordby using a hardware multiplier and accumulator on the set of factors andthe selection of symbols of the received codeword, comparing by ahardware comparator one or more of the recomputed remaining symbols ofthe original codeword with a corresponding said one or more remainingsymbols of the received codeword or the recomputed remaining symbols ofthe original codeword from another one of the error-detecting circuits,and outputting first results of this comparison.

The method may further include outputting by each of the error-detectingcircuits at least one of the recomputed remaining symbols of theoriginal codeword as error-corrected symbols of the received codewordbased on the first results.

The first results may include complete equality of the comparedrecomputed remaining symbols of the original codeword.

The symbols of the received codeword may be ordered. The processing byeach of the error-detecting circuits may further include processing adifferent consecutive selection of symbols of the received codeword.

A number of the error-detecting circuits may equal a number of thesymbols of the received codeword.

The error-detecting circuits may be ordered by respective first symbolsof their corresponding consecutive selections of symbols of the receivedcodeword. For each of the error-detecting circuits, the other one of theerror-detecting circuits may be a consecutive one of the error-detectingcircuits.

BRIEF DESCRIPTION OF THE DRAWINGS

The accompanying drawings, together with the specification, illustrateexample embodiments of the present invention and, together with thedescription, serve to explain aspects and principles of the presentinvention.

FIG. 1 is a block diagram of an example hardware error detector andcorrector according to an embodiment of the present invention.

FIG. 2 is a block diagram of an example hardware error detector andcorrector according to another embodiment of the present invention.

FIG. 3 is a block diagram of an example system for hardwareerror-correcting code (ECC) detection or correction of a receivedcodeword from an original codeword according to an embodiment of thepresent invention.

FIG. 4 is a block diagram of an example system for hardware ECCdetection or correction of a received codeword from an original codewordaccording to another embodiment of the present invention.

FIG. 5 is a block diagram of an example system for hardware ECCdetection or correction of a received codeword from an original codewordaccording to yet another embodiment of the present invention.

FIG. 6 is a block diagram of an example system for hardware ECCdetection or correction of a received codeword from an original codewordaccording to still yet another embodiment of the present invention.

FIG. 7 is a block diagram of an example software-based system for ECCdetection or correction of a received codeword from an original codewordaccording to an embodiment of the present invention.

DETAILED DESCRIPTION

Hereinafter, example embodiments of the present invention will bedescribed in more detail with reference to the accompanying drawings. Inthe drawings, like reference numerals refer to like elements throughout.Herein, the use of the term “may,” when describing embodiments of thepresent invention, refers to “one or more embodiments of the presentinvention.” In addition, the use of alternative language, such as “or,”when describing embodiments of the present invention, refers to “one ormore embodiments of the present invention” for each corresponding itemlisted.

The present application relates to Reed-Solomon error correction,described more fully in the Reference application. Briefly, withReed-Solomon encoding, data words of K symbols are encoded with anadditional T symbols (parity symbols or check symbols) to producecodewords of N=K+T symbols. Each of these codewords has theextraordinary property that the entire N symbol codeword isreconstructable from any K of its symbols (i.e., any K symbols of thecodeword can be used to generate the remaining T symbols). While thereconstruction property generally requires knowledge of which K symbolsare available, Reed-Solomon error correction is directed to the moregeneral problem of reconstructing a codeword when some number (up toT−1) of its symbols have errors, but the location of the errors isunknown.

As described in the Reference Application, this problem can be attackedby brute force, looking for the largest consistent subset of at leastK+1 symbols, and then reconstructing the other symbols from thisconsistent subset. Here, a consistent subset refers to a subset of the Nsymbols of the codeword for which no errors can be detected. Subsets ofK symbols or fewer are consistent by definition since at least K+1symbols are needed to check for inconsistency.

This brute force approach may be performed by software routinesdescribed in the Reference Application, but in general, their complexitygrows exponentially with the size of the codeword. Reed-Solomonencoding/decoding involves Galois Field arithmetic. While Galois Fieldaddition is relatively straightforward, Galois Field multiplication issignificantly harder. In general, for N symbol codewords with T paritysymbols, it takes K Galois field multiplication operations (hereafterreferred to as “multiplys”) to encode each parity symbol, or a total ofK×T multiplys for all T parity symbols. Decoding is of similarcomplexity.

In an embodiment of the present invention, this process is sped upthrough the use of a sliding window list decoder, which is a collectionof Parallel Error Correctors (PECs). Each PEC is a hardware circuit, ascan be realized in, for example, a field programmable gate array (FPGA),that performs the Galois field multiplication and addition operations inparallel for either the encoding or decoding of a particular combinationof K symbols of the N symbol codeword, or a separate processing corethat performs the multiplication and addition in parallel with otherprocessing cores. For example, in a PEC, for any K symbols of codeword,in one step (clock cycle), the K multiplys can be performed in parallelfor each of the T other symbols, and the products added together inparallel in the next clock cycle, thus yielding the T other symbols.During the next clock cycle, the computed T symbols can be compared tothe received codeword's corresponding T symbols, noting any differencesas indicative of error location or possible correction.

In other embodiments, symbols produced by one PEC may be compared tothose of another PEC, producing the same result. A collection of theseresults (mismatch or match, 1 or 0) may identify all possible errors oflength less than T. The location of the error symbols and the length ofthe error symbols may be identified by the location of the ‘1’ bits inthe resultant collection.

For ease of description, it will be assumed throughout that the PECs donot generate errors themselves. Further, in a low latency implementationof an embodiment of the present invention, all of the multiply,addition, and comparison operations could be performed on the same clockcycle, saving both time and resources, since no intermediate registerswould be required.

For example, using one of the PECs, the K data symbols in the codewordmay be used to generate the T parity symbols. This may be useful, forexample, for normal parity symbol generation. It may also be useful forcomparing the newly-generated parity symbols to the existing codeword.If there are any discrepancies between the newly generated paritysymbols and the existing symbols, then an error has been detected. Thatis, the codeword is not consistent.

FIG. 1 is a block diagram of an example hardware error detector andcorrector 100 (e.g., a decoder or PEC) according to an embodiment of thepresent invention. It should be noted that this is one example of a PEC,performing both error detection and correction. Other embodiments ofPECs may do only error detection, or may work in conjunction with otherPECs to do error detection or correction.

The PEC 100 is illustrated conceptually in FIG. 1 by the differentprocessing steps, many of which are described in more detail below.Processing starts with a received codeword 110 having N=K+T symbols asinput to the PEC 100. The received codeword 110 represents an originalcodeword of N symbols that, for example, has been transferred (such asbetween computing devices) or stored (such as on a storage media),possibly acquiring errors along the way. The PEC breaks this receivedcodeword 110 into two parts, a selection of K symbols 120 (can be any Ksymbols) and the remaining T symbols 125. For example, the PEC 100 maybe customized to process a particular set of K symbols 120 from areceived codeword 110.

The selection of K symbols 120 is paired with a set of factors 130 andinput to a matrix multiplier 140. The set of factors 130 corresponds tothe selection of K symbols 120 as part of the Reed-Solomon technique.For example, the PEC 100 may be custom designed to select a particularset of K symbols 120 of the received codeword and process these selectedK symbols 120 with a corresponding set of factors 130 (which stay fixedfor the same selection of K symbols 120 from any N symbol codeword 110generated with the same Reed-Solomon protocol). The matrix multiplier140 performs the Galois field multiplication and addition to recomputethe remaining T symbols 150 of the original codeword (assuming thatthere are no errors in the K selected symbols 120).

The recomputed remaining T symbols 150 of the original codeword are theninput to a hardware comparator 160, which compares the recomputedremaining T symbols 150 of the original codeword with the remaining Tsymbols 125 of the received codeword. For example, a symbol-by-symbolcomparison may be performed, producing a corresponding bit vector of 1'sand 0's (such as an EXCLUSIVE OR comparison, producing 1 if thecorresponding two symbols being compared are different and 0 if they arethe same). The output of the comparator 160 may also be thought of as anindicator of correction confidence 175. For example, an output of all0's may be regarded as complete confidence in the error correction,while an output of all 1's may be regarded as complete lack ofconfidence in the error correction. In other embodiments, counts (suchas counts of the number of 0's) or other indicators may be output as thecorrection confidence 175 instead of or in addition to the bit vectors.The correction confidence 175 is discussed further below.

As part of the correction process, the PEC 100 may take the K selectedsymbols 170 (e.g., the same K selected symbols 120) and output them withthe corrected T symbols of the received codeword 172 (e.g., the samerecomputed T symbols 150 of the original codeword) as the correctedcodeword 180 (e.g., N symbols) depending on the correction confidence175. For example, when any errors are contained in the remaining Tsymbols 125 of the received codeword, the PEC 100 will recompute thecorrect T symbols 172 of the original codeword as part of itsprocessing. However, depending on the correction confidence 175, the PEC100 may or may not output this corrected codeword as the correctedcodeword 180.

FIG. 2 is a block diagram of an example hardware error detector andcorrector 200 (e.g., a decoder or multi-PEC system) according to anotherembodiment of the present invention. As with FIG. 1, the decoder 200 inFIG. 2 is illustrated as performing both error detection and correction,but in other embodiments, the decoder may do only error detection.

The decoder 200 includes a plurality of PECs 210 (such as N=K+T suchPECs 210). For example, each PEC 210 may be similar to the PEC 100described in FIG. 1. Here, the N=K+T symbols are numbered 1, 2, . . . ,K+T. Each PEC 210 is responsible for receiving a different K consecutivesymbols and generating a corresponding remaining T consecutive symbols.For example, the first PEC 210 receives symbols 1, 2, . . . , K, andgenerates symbols K+1, K+2, . . . , K+T, the second PEC 210 receivessymbols 2, 3, . . . , K+1, and generates symbols K+2, K+3, . . . , K+T,1, and so on, while the last (Nth or (K+T)-th) PEC 210 receives symbolsK+T, 1, 2, . . . , K−1, and generates symbols K, K+1, . . . , K+T−1.

The PECs 210, in turn, process their corresponding portions of thereceived codeword, producing outputs such as corrected codewords (or bitvectors) and correction confidences, such as those described above inreference to FIG. 1. These are input to a multiplexor 230 (ormultiplexor logic), which produces the desired outputs from the decoder200. For example, the decoder 200 may output one or more of a correctedcodeword 240 (e.g., as described above in FIG. 1), error locations 242(such as a bit vector of good and bad input symbols), and a generalerror indicator 244 (such as when the decoder 200 detects inconsistencyin the received codeword but is unable to correct it with sufficientconfidence).

For example, the decoder 200 may be a special hardware circuit fordetecting any error, but correcting only simple errors (such as singlesymbol errors). For more involved error scenarios (which may berelatively rare), the decoder 200 may pass control to a more thorougherror corrector, such as a brute force software error corrector.

For a given Reed-Solomon encoding, such as for K=20 data symbols andT=10 parity symbols, a fixed K×T=20×10 encoding matrix may be used,performing 200 multiplys concurrently in one clock cycle. Because thefactors may be constants for a given encoding matrix, the hardware canbe built to take advantage of these fixed values, which significantlysimplifies the design. The partial products may then be summed in T=10groups each of K=20 products concurrently to produce the T=10 paritysymbols in another clock cycle. In the third clock cycle, the newlycomputed (regenerated) parity symbols can be compared to those of theexisting codeword to determine if there are any errors detected. Inanother embodiment, all three steps could be performed in a single,longer clock cycle because of the additional latencies required for eachstep. In both cases, the correct decoding of Reed Solomon codewordsoccurs more frequently and at lower latency than traditional serialdecoders.

In general, there are four possible comparison results: (1) all theregenerated parity symbols match the codeword parity symbols, in whichcase the codeword is consistent and no errors are detected, (2) all theregenerated parity symbols mismatch the codeword parity symbols, inwhich case the codeword is inconsistent and an indeterminate number oferrors are detected, (3) at least half (but not all) of the regeneratedparity symbols match the codeword parity symbols, and the remainingsymbols mismatch the codeword parity symbols, in which case the codewordis inconsistent, but a (single) closest valid codeword is detected, sothe errors are correctable, or (4) less than half (but at least one) ofthe regenerated parity symbols match the codeword parity symbols, inwhich case there could be closer valid codewords, so more information isneeded before making a determination of what to do. Cases (1), (2), and(3) are straightforward (namely, good codeword, bad and uncorrectablecodeword, and bad and correctable codeword, respectively), but case (4)is more nuanced.

It should be noted that the case of greater than T errors is notconsidered in any of these scenarios. Simply stated, once the number oferrors exceeds the number of parity symbols T, it becomes impossiblewith Reed-Solomon codes to discern codewords with large numbers oferrors from valid codewords with no errors or few numbers of errors.Likewise, while detecting T errors is possible with T parity symbols,attempting to correct more than T−1 errors is equally impossible to dowith any certainty or confidence. Accordingly, any error correctingscenario with Reed-Solomon codes assumes that there are no more than T−1errors in unknown positions.

Define the confidence C of such a comparison result to be the number ofregenerated parity symbols that match the codeword parity symbols. Thetrivial cases of C=T and C=0 then reduce to cases (1) and (2),respectively, while the correctable case of C≥T/2 is case (3), so assume0<C<T/2. For 8-bit symbols (e.g., bytes) having 256 possible values,random errors, K=20 data symbols, and T=10 parity symbols, then thelarger the value of C, the more certain that the errors are all in thecodeword parity symbols.

If any one of the data symbols is corrupted, it will cause a completelydifferent set of parity symbols to be regenerated (that is, not one willmatch the original parity symbols). This follows from the Galois fieldmultiplication used to generate the parity symbols, as well as theproperty that valid codewords are either identical or differ in at leastT+1 corresponding symbol positions (i.e., have a Hamming distance of atleast T+1). Changing one input symbol thus forces the resulting T paritysymbols to be completely different than the original parity symbols.Accordingly, the only parity symbols that will match are those thathappen to have an error but that are nonetheless consistent with thecorrupted data symbols. Intuitively, for uniformly random symbol values(e.g., errors) showing up in the regenerated parity symbols, this is aone in 256 chance per regenerated parity symbol (for one-byte symbols),which works out to less than one in 25 of happening over all 10 paritysymbols, less than one in 1437 of happening with two or more paritysymbols, and less than one in 142,709 with three or more parity symbols.

With this in mind, in one or more embodiments, the PEC may also performerror correction with C<T/2. Larger values of T (and hence, C) allowmore errors to be corrected (a property of Reed-Solomon codes). In thiscase, if a relatively large value of C is obtained, such as close to T/2with a suitably large value of T, the likelihood is that all the errorsare contained in the codeword symbols that were regenerated, so the PECcan output the K selected symbols from the codeword plus the regeneratedT symbols as the corrected N symbol codeword. However, if one of the Kselected symbols from the codeword has an error, the overwhelminglikelihood is that the confidence C will be 0 or perhaps 1, and highlyunlikely to approach T/2. In this case, the PEC may output an indicatorof inconsistency (error detection) without any error correction. Inanother embodiment, the PEC may output a possible corrected codeword forlarger values of C (such as 3 when T=10). By comparing T symbols, allerrors (such as all burst errors) in these symbols with size smallerthan T will be identified with at least one match.

For example, with K=20 and T=10, C=4 may be sufficient confidence from asingle PEC to correct the codeword. After all, if there are errors inthe selected symbols for the PEC, the odds that the regenerated symbolwill match the original symbol is only one in 256 (for one-bytesymbols), which for T=10 total symbols, the probability of reaching C=4(or more) by coincidentally matching four of the original codewordsymbols is less than one in 20,839,805.

FIG. 3 is a block diagram of an example system 300 for hardwareerror-correcting code (ECC) detection or correction of a receivedcodeword 310 from an original codeword according to an embodiment of thepresent invention. The system 300 may be realized, for example, inhardware, such as an FPGA or other custom circuit, using arithmeticlogic circuits and memory circuits, as would be apparent to one ofordinary skill.

The system 300 includes an error-detecting circuit (or PEC) 320, whichinputs N symbols from a received codeword 310 and outputs comparisonresults 390. The PEC 320 breaks up the N input symbols 310 into twosets: a set of K selected symbols 330 for use in regenerating the otherT (unselected) symbols 370, and a set of T remaining symbols 360 for usein comparing with the T regenerated symbols 370. For simplicity, thecodewords are assumed to be a standard length (e.g., N=K+T symbols),with standard symbols and a standard Reed-Solomon encoding matrix.Accordingly, for any fixed set of K selected symbols 330, there exists aunique set of (K×T) factors 350 for regenerating the T unselectedsymbols 370 from the K selected symbols 330. To this end, the PEC 320includes a hardware multiplier and accumulator 340 for regenerating theT unselected symbols 370 from the K selected symbols 330 using thefactors 350. The PEC 320 further includes a hardware comparator 380 forcomparing the T regenerated symbols 370 with the corresponding receivedremaining symbols 360 and outputting the comparison results 390.

By way of example, the PEC 320 may output (as the comparison results390) a count of or a T-bit vector identifying which of the T regeneratedsymbols 370 equal their corresponding received remaining symbols 360. Inaddition (or instead of), the PEC 320 may output (as the comparisonresults 390) the regenerated symbols 370, or a combination of thereceived selected symbols 330 and the regenerated symbols 370 (forexample, as a regenerated codeword, such as an error-correctedcodeword). For example, depending on the symbol-by-symbol comparison,the PEC 320 may output an error-corrected codeword. It should be notedthat in other embodiments, the PEC 320 may generate only a subset of theT unselected symbols as the regenerated symbols 370, comparing them to acorresponding subset of the T received remaining symbols 360, andoutputting corresponding comparison results 390 of the compared subsets.

While a single PEC implementation is useful, even more power can beharnessed by running multiple PECs concurrently, each on a different Ksymbols of the codeword. For example, in some embodiments, the codewordsymbols may be ordered or otherwise arranged to enable groups of Ksymbols to be systematically chosen. In particular, groups may be chosenthat have a higher propensity for exhibiting errors, such as consecutivesymbols in case of an error burst (several contiguous symbols of noise).For example, the N symbols may be arranged sequentially, and N groups ofK consecutive symbols selected, each group starting at a differentsymbol position and continuing sequentially (with wraparound) for Ktotal symbols. Thus, consecutive groups overlap in K−1 symbols. Each ofthese groups may then be assigned to a separate PEC, each of which has acustom K×T encoding matrix (used to produce the remaining T symbols).This takes maximal advantage of the Reed-Solomon erasure code property,namely that any K symbols of an N symbol codeword can be used to(re-)generate the remaining T symbols.

There may be other groups of K symbols having this higher propensity toexhibit errors. Accordingly, in different embodiments, differentselections of K symbols (such as different systematic selections) may bechosen. For example, consecutive even-numbered or consecutiveodd-numbered symbols may be more likely to exhibit error bursts.Consequently, in addition to, or in place of, the sequential selectionof K symbols discussed above, selections of consecutive even-numberedsymbols or consecutive odd-numbered symbols may be chosen (together withother symbols if needed to reach a total of K symbols). Fewer groups mayalso be chosen, such as N/2 groups of K consecutive symbols, each groupstarting at a different even-numbered symbol position.

In addition, in one or more embodiments, a fourth clock cycle (or anadditional latency) is added to the multiplication, addition, andcomparing cycles described above for a single PEC. In the fourth clockcycle, the outputs of the PECs are compared between themselves, and aselection is made of what information to output. For example, in oneembodiment, the most likely error-corrected codeword is output (whenpossible), while in another embodiment, the most likely error symbolsare output (depending, for instance, on what values the follow-oncircuitry is expecting). In still another embodiment, a list of possibleerror-corrected codewords is output, perhaps with confidence levels foreach codeword (and perhaps a sorted list by confidence level).

While three basic output conditions are identified above for a singlePEC, the outputs of the N PECs described above cover many more cases. Ina non-limiting embodiment, assume without loss of generality that afterthree clock cycles (or three sets of latency), each PEC outputs twopieces of information: the corrected codeword and the confidence valueC. There are numerous scenarios of what the group of N PECs may output.

For example, the possibilities include: (a) all PECs output C=T, inwhich case no errors are detected; (b) all PECs output C=0, in whichcase an indeterminate number of errors exists, and no error correctionis possible; (c) two or more (say P) PECs output a maximum confidenceC=C_(max), and all P PECs with confidence C=C_(max) generate the samecorrected codeword, in which case the same corrected codeword is outputas the most likely correct original codeword, this time with twoconfidence values, namely first order confidence C_(max) and secondorder confidence P; (d) only one PEC outputs a maximum confidenceC_(max), but C_(max) is sufficiently large (say, C_(max)≥T/2 or almostas large as T/2) that it is certain or overwhelmingly likely that thisPEC regenerated the closest valid codeword, which is then output as thecorrected codeword with a first order confidence of C_(max) and anominal second order confidence of 1; and (e) none of the above, such asP>1 but no unanimity of the corrected codeword among these P PECs, orP=1 but C_(max) is not sufficiently large, such as C_(max)<T/2 (or notquite as large as T/2), in which case only a general error indication isoutput.

Cases (a) and (b) are trivial, and similar to the one PEC output cases(1) and (2), respectively. Case (e) is similar to case (b), just beingdecided by a threshold confidence value versus the trivial no confidencescenario. Case (d) is similar to the single PEC output case (3)described above. Case (c), though, is the more interesting case. Thefirst order confidence value C_(max) is inversely related to the numberof corrected symbols F, namely F=T−C_(max). While low values of C_(max)thus indicate a large number of errors, this seemingly low confidenceindication is partially offset by the fact that multiple PECs generatedthe same corrected codeword despite using different encoding matrices.For large first order confidence values C_(max), or large second orderconfidence values P, this is unlikely to happen unless the error symbolsare all located in the symbol positions being regenerated by each of theP PECs.

That is, the greater the value of C_(max) or the greater the value of P,the greater the confidence. First order confidence C_(max), however,provides a greater indication of confidence than second order confidenceP since there is a greater likelihood for multiple PECs to generate thesame incorrect codeword when C_(max) is low whereas a high value ofC_(max) almost always results from correctly finding the closest validcodeword regardless of the value of P. From another viewpoint, C_(max)=1is almost always going to be seen as a low confidence answer regardlessof the value of P whereas C_(max) close to T/2 is almost always going toseen as a high confidence answer even if P=1. Accordingly, in someembodiments, C_(max)=1 may be treated as a general error indicator whileC_(max) close to T/2 (or higher) may be treated as a correct regeneratedcodeword.

The above set of cases is just an example. In other embodiments,different cases may be used. For example, the requirement of unanimitymay be too extreme for large values of F. Accordingly, in someembodiments, case (c) may be replaced with (c′): two or more (say P)PECs output a maximum confidence C=C_(max) (say, at least 2), and asingle group (say, of size P_(max) PECs) of these P PECs with confidenceC=C_(max) generate the same corrected codeword, in which case the samecorrected codeword is output as the most likely correct originalcodeword, this time with two confidence values, namely first orderconfidence C_(max) and second order confidence P_(max). Here, the otherP−P_(max) PECs likely stumbled on the same (low) confidence C_(max) bychance, but the multiple P_(max) PECs provides sufficient (second order)confidence to believe that this group regenerated the correct originalcodeword.

FIG. 4 is a block diagram of an example system 400 for hardware ECCdetection or correction of a received codeword 410 from an originalcodeword according to another embodiment of the present invention.

The system 400 includes a plurality of PECs 420, 440, . . . , 460, eachprocessing a different selection of K symbols from the N input symbols410 using a corresponding different set of factors and generating adifferent set of T remaining or unselected symbols (or subset thereof),and generating potentially different respective comparison results 430,450, . . . , 470. For example, each of the PECs 420, 440, . . . , 460may be a separate instance of PEC 320 in FIG. 3.

In addition, the system 400 includes a group comparison circuit 480,such as a multiplexer circuit (MUX), for outputting group comparisonresults 490. For example, the group comparison circuit 480 may receivethe error corrected codeword from each of the PECs 420, 440, . . . ,460, and select the codeword from the PEC having the largest confidence(such as the PEC generating the most number of symbols that match thecorresponding received codeword symbols). In another example, if all ofthe PECs 420, 440, . . . , 460 return small confidence values, the groupcomparison circuit 480 may return a general error indicator as the groupcomparison results 490, or a best guess as to the correct codeword, orsome other result such as those described above.

Increasing the symbol size (entry size), say from eight bits to 16(e.g., one byte to two bytes), or increasing the number of paritysymbols T, say doubling, significantly reduces the likelihood ofdeciding among seemingly low confidence alternatives. For instance,16-bit symbols or entries are highly unlikely to compare equal by chance(e.g., the one in 256 chance discussed above becomes one in 65,536).Likewise, increasing T increases the number of symbols that the PEC cancheck, which increases the chance that the PEC (and more importantly,multiple PECs) generate the same corrected codeword while alsoeliminating competition from other low confidence PECs that come up withdifferent corrected codewords.

Another Approach

In the above multi-PEC approach, the comparisons are done in a mannersimilar to the single PEC approach, namely comparing the results of theregenerated T symbols to the corresponding T symbols of the codeword.Another approach with multiple PECs is to compare the regeneratedcodewords to the regenerated codewords of other PECs. For example, ifthe N symbols of the codeword are arranged in a cyclical order (such as1, 2, . . . , N, and then back to 1 again), then N separate PECs canprocess regenerating the T consecutive symbols starting at each of the Ndifferent symbol positions of the codeword. This creates N separatecodewords. The regenerated codewords can then be compared to each other.

As the regenerated codewords from the PECs are assumed to be valid (orconsistent) codewords (since the PECs are assumed to not introduce anynew errors in their calculations, and generated or regenerated codewordsof N symbols from K initial symbols are by definition consistent), theregenerated codewords are either pairwise equal or differ in at leastT+1 symbols. This is a property of Reed-Solomon encoding: each paritysymbol grows the (Hamming) distance between different consistentcodewords by at least one more than the number of parity symbols (T),which is T+1 symbols in this case. However, adjacent PECs (i.e., thosepairs of PECs assigned to the K symbols at each of consecutive symbolpositions of the codeword) by construction share the same K−1 initialsymbols, leaving only a maximum of T+1 symbols by which they can differ.Accordingly, the adjacent PECs either generate the same remaining T+1symbols or they generate completely symbolwise distinct sets of theremaining T+1 symbols. As such, it suffices to compare only a singlesymbol of these T+1 symbols between each pair of adjacent PECs.

Accordingly, using this technique, up to T−1 errors are correctable inthe T−1 shared parity symbols of the two adjacent PECs. By extension, aburst of up to T−1 errors in any T−1 consecutive symbols of the codewordis correctable by testing for equality in any one of the T+1nonoverlapping symbols of the corresponding adjacent PECs. It should benoted that the confidence of such a correction (e.g., T−1 errors) may berelatively low, but that can be addressed in other ways, such as usinglarger symbol sizes, invoking other techniques (such as brute force,Welch-Berlekamp, etc.) in such circumstances, outputting lists ofpossible regenerated codewords or error locations (possibly withconfidence levels as well), examining outputs of other PECs andcomparisons between PECs, etc.

FIG. 5 is a block diagram of an example system 500 for hardware ECCdetection or correction of a received codeword 310 from an originalcodeword according to yet another embodiment of the present invention.

The system 500 of FIG. 5 has similar features and reference numerals tothat of FIG. 3, so for briefness of description, discussion will focusprimarily on the differences. In FIG. 5, the PEC 520 has a similarorganization to the PEC 320 of FIG. 3. The most significant differenceis that the PEC 520 receives two different input streams for comparisonsymbols, namely the N symbols from the received codeword 310, and somenumber (such as between 1 and T) of regenerated symbols 510 from anotherPEC, such as an adjacent (or consecutive) PEC. The remaining symbolslogic 560 may, for example, decide which symbols to use for comparison,or the selection may be built into the hardware (e.g., choosing onesymbol from the received codeword and T−1 symbols from the regeneratedsymbols 510 from another PEC). In other embodiments, the PEC 520 mayreceive regenerated symbols 510 from two or more other PECs, such asboth adjacent PECs in a consecutive ordering of the PECs.

The comparison results 590 of the PEC 520 may be modified as appropriatefrom the comparison results 390 of FIG. 3 to account for the differencesbetween the embodiments, such as which symbols take part in thecomparisons. In addition, the regenerated symbols 570 in PEC 520 ismodified from the regenerated symbols 370 in PEC 320 to also output someor all of the regenerated symbols 570 to another (or more than one) PEC580.

FIG. 6 is a block diagram of an example system 600 for hardware ECCdetection or correction of a received codeword 410 from an originalcodeword according to still yet another embodiment of the presentinvention.

The system 600 has similar features to that of FIG. 4, so for briefnessof description, discussion will focus primarily on the differences. Eachof the PECs 620, 640, . . . , 660 may be a separate instance of PEC 520in FIG. 5. The PECs 620, 640, . . . , 660 share their regeneratedsymbols as discussed above in FIG. 5. For example, the first PEC 620 mayreceive the regenerated symbols 610 from the last PEC 660, and the firstPEC 620 may in turn send its regenerated symbols to the second PEC 640.This arrangement may then repeat through all the PECs 620, 640, . . . ,660, so that each PEC receives the regenerated symbols from itsimmediately preceding PEC.

As a non-limiting example, there may be N PECs altogether, with PEC 620receiving and processing symbols 1, 2, . . . , K from the receivedcodeword 410, PEC 640 receiving and processing symbols 2, 3, . . . ,K+1, and so on, with PEC 660 receiving and processing symbols K+T, 1, 2,. . . , K−1, as described above in reference to FIG. 2. Each PEC maythen output the T−1 regenerated symbols it shares in common with itsadjacent PEC to which it is sending regenerated symbols (for theadjacent PEC to use in its comparisons with its own regeneratedsymbols).

The corresponding comparison results 630, 650, . . . , 670 may thencontain results of the comparisons with the regenerated symbols of theadjacent PECs (such as bit vectors, corrected codewords, etc.), whichmay then be output to a group comparison circuit 680 (such as amultiplexer MUX), which may output the final comparison results 690 ofthe system 600.

Implementation Considerations

While the above description covers some of the theoretical aspects ofparallel error correctors for error correcting codes, what follows aresome implementation considerations for building such parallel errorcorrectors.

Of the T+1 nonoverlapping codeword symbols in each pair of adjacentPECs, one symbol does not require recomputation in the first PEC andanother symbol does not require recomputation in the second PEC. LetB₁B₂ . . . B_(N) be the N symbol supplied codeword (with symbols B₁, B₂,. . . , B_(N)) being subjected to error correction. Assume PEC P₁provided with initial symbols B₁B₂ . . . B_(K) and PEC P₂ provided withinitial symbols B₂B₃ . . . B_(K+1) are the two adjacent PECs to betested for error burst correction. Then they overlap in the K−1 initialsymbols B₂B₃ . . . B_(K) and they share regenerating the T−1 paritysymbols B_(K+2)B_(K+3) . . . B_(K+T), with PEC P₁ further regeneratingsymbol B_(K+1) and PEC P₂ further regenerating symbol B₁.

Accordingly, from the discussion above, it suffices to regenerate only asingle codeword symbol from each of the PECs. For example, each PECP_(i) can regenerate the codeword symbol B_(i−1) to compare with thesupplied codeword symbol B_(i−1) and if the two are the same, then,depending on the similar comparisons of the other PECs, an error burstof up to T−1 symbols among the T−1 shared symbols being regenerated byPECs P_(i−1) and P_(i) is detected and can be corrected. This isequivalent to generating an inconsistency bit string S of N bits, witheach bit S_(i) in the inconsistency string S being 1 if supplied symbolB_(i) is not consistent with the K succeeding symbols B_(i+1), B_(i+2),. . . , B_(i+K), and being 0 if it is consistent. Each PEC P_(i) thencomputes S_(i−1), so the inconsistency string S can be computed inparallel among the N PECs, each bit corresponding to a differentreconstruction of the original symbols from the received symbols. Eachreconstruction corresponds to a different consecutive K symbols.

The resulting N-bit string of bits S that represents matches (0) andmismatches (1) indicates the location of possible errors (for example,burst errors) of size less than T. For example, a single error (such asin received symbol B₁) generates S₁=1, S₂=S₃= . . . =S_(T)=0, andS_(T+1)=S_(T+2)= . . . =S_(K+T)=1 (that is, T−1 consecutive 0's and K+1consecutive 1's), with burst errors of size T−1 symbols or less beingguaranteed to generate at least one 0 bit in the inconsistency string S(since they contain at least one set of K+1 consecutive symbols that areconsistent).

In general, if the sliding window successfully corrects a burst error oflength L (less than or equal to T−1) and “weight” W (for example, Lconsecutive symbols, the first and last having errors and a total of Wsymbols having errors) then there will be T−L+1 consecutivereconstructions that agree and so T−L consecutive 0's in theinconsistency string S. If it is desired to know how long a burst erroris, then it is not necessary to reconstruct all T symbols of eachreconstruction to compare with the received symbols, the system can justcompare 1 symbol for each PEC, as discussed above. The number of symbolsin which the reconstruction agrees with the received codeword will beT−W. If the correctly reconstructed symbols are desired, then for eachstring of 0's in the results (inconsistency string S), the system couldrecompute all T symbols for any one of the corresponding consecutivereconstructions, and compare to the received symbols.

For reconstructing symbols from a window of size T, it is possible toprecompute a set of factors that allows the recomputation of the Tsymbols from the K received symbols outside the window (e.g., asdiscussed in the Reference Application). However, in some cases it maybe more efficient to instead use a different set of factors that allowthe T symbols to be recomputed using the unerased parity symbols (thatis, those of the last T symbols of the received codeword that areoutside the window), along with the “partial parity” of the uneraseddata symbols (that is, those of the first K symbols of the receivedcodeword that are outside the window). The partial parity is the resultof the encoding matrix applied to the data vector (that is, the first Ksymbols of the received codeword), but with rows and columnscorresponding to symbols inside the window removed. One feature of thismethod is that when reconstructing different windows but using the samereceived codeword, the overall parity can first be computed and thenjust subtract the parity of the window. If T is smaller than K this canlead to improved efficiency.

Other Uses

While the above description covers some of the embodiments of thepresent invention, aspects of the present invention can be applied todifferent applications, some of which will be described here.

For example, the above-described ECC decoder may be usefully combinedwith existing ECC decoders to preserve and then improve the performanceof existing systems and standards. One practical implementation would bea software only change to an existing hardware ECC decoder system. Forexample, before error recovery is abandoned, in one embodiment, asoftware-based approach of the above techniques could be tried in anattempt to recover the data. Another embodiment could be a hardwarecombination of the old approach with the new approach, with anintermediate circuit to usefully combine both results.

As an example of such an embodiment, a Berlekamp decoder could beaugmented with a sliding window decoder. If the Berlekamp decoder failedto decode (so any error must have weight greater than T/2), then thesliding window decoder could be tried, providing a chance to correctburst errors of weight between T/2 and T−1. In another embodiment, thesliding window decoder could be used to make Berlekamp moreconservative. If Berlekamp identifies a codeword within T/2 of thereceived codeword but the sliding window also sees that it is within T−1of some other codeword via a burst error, then the combined decodercould declare decoder failure (e.g., decide that the Berlekamp hasstumbled upon a silent data corruption of more than T/2 errors but fewerthan T, but that is being mistakenly reported as fewer than T/2 errors),or log the fact that this decoding is perhaps suspect.

Another different practical embodiment would be to use a hardware basedsliding window decoder first and then use a Berlekamp decoder if thesliding window failed. The low latency of the sliding window approachcould speed up error correction on average.

For a software based approach, a processor (such as a microprocessor)configured to execute a set of instructions (such as machineinstructions) could be used. The instructions may be stored on anonvolatile storage device, such as a disk drive or Flash memory. Theprocessor may have a memory for accessing the instructions and inputdata, and for storing output data. The processor may have one or moreprocessing cores that can perform the processing on different threads inparallel. In general, the above-described hardware approach may besimulated in software as would be apparent to one of ordinary skill.

Conceptually, the software version replaces the multipliers and addersdiscussed above with the memory and the processor capable of performingGalois Field arithmetic. For example, in place of the multiplication andaddition circuits discussed above, corresponding locations in the memorycould be used, with the processor performing the multiplication andaddition, storing the results in the corresponding locations in thememory, as would be apparent to one of ordinary skill.

By way of example, in one embodiment, this software approach may beperformed serially by a single processor. That is, the above-describedfully parallelized hardware solution with dedicated hardware (separatemultipliers and adders) is replaced with a serial machine with a sharedmultiplier and adder (the processor) working together with the memory toperform the same Galois Field arithmetic, comparisons, and outputting.

This software solution may be accelerated with a “parallel multiplier.”Using the above example of codewords having 30 symbols, namely 20 datasymbols and 10 parity symbols, each software “PEC” recomputes 10 missingsymbols from 20 received symbols. This involves 200 separate multiplys,namely 10 separate multiplys for each of the 20 received symbols. Eachreceived symbol is thus multiplied by 10 separate factors as part of itsshare of the 200 multiplys. These multiplications can be carried out inparallel using SSE instructions in a manner similar to that described inthe Reference Application, only here, the supplied symbol stays thesame, and this symbol is multiplied by a set of 10 factors. Since the 10factors are known in advance, and are used to process this symbolposition in all such codewords, the lookup table can be built in advanceto do the 10 separate multiplications in parallel.

Accordingly, all such lookup tables can be pre-built for each of the 20symbol positions in each of the 30 software “PECs,” which willsignificantly speed up the Galois Field multiplication. Depending onfactors such as the symbol size, the data processing size of theprocessor, and the number of symbols T being regenerated, the processormay be able to perform all of the multiplys for a single symbol positionof a single “PEC” in a handful of instructions. By way of example, theReference Application discusses a method of multiplying 64 byte-sizedsymbols at a time by a single symbol using the SSE architecture. Here,the technique is similar, only it is the single symbol of the receivedcodeword being multiplied concurrently by the different set of factorsneeded to regenerate the T missing symbols. For example, with byte-sizedsymbols, a similar implementation could be used to process values of Tup to 64 in the same short parallel multiplier loop.

FIG. 7 is a block diagram of an example software-based system 700 forECC detection or correction of a received codeword from an originalcodeword according to an embodiment of the present invention.

The system 700 uses a computer 710 (such as a general purpose computer,personal computer, customized computer, etc.) that includes a processor720 and memory 730 for carrying out the Reed-Solomon encoding anddecoding operations on supplied codewords. The system 700 also includesa non-transitory storage device 740, such as a disk drive, for storingdata and program instructions used by the processor 720 to perform thistask. For example, the memory 730 may load software instructions fromthe storage device 740 that, when executed by the processor 720, causesthe processor to perform any of the above-described hardware ECCdetection or correction implementations. Codewords may be supplied, forexample, from the storage device 740 (or supplied over a network, suchas a local area network or wide area network). The processor may performthe Reed-Solomon encoding and decoding operations on the suppliedcodewords and output the results to, for example, the storage device 740or attached networks.

The software-based system 700 may conceptually process more combinationsof selected symbols from the supplied codewords than is practical toimplement in hardware. For example, the system 700 may be employed whena codeword has been detected by a hardware-based system (such as onedescribed above) to be corrupted, but whose correction is not possibleby the hardware-based system (or at least not with sufficient confidenceof the result). To this end, the system 700 may store all of the factorsneeded to do the Reed-Solomon encoding and decoding operations on thestorage device 740, loading those sets of factors (into the memory 730)needed for the particular encoding or decoding to be performed. Furtheroptimizations, such as the parallel multiplier discussed above, may beemployed by the processor 720 if, for example, the processor 720supports the features (e.g., SSE architecture) needed for a particularoptimization. In some embodiments, the processor 720 may be customizedto perform the Reed-Solomon encoding and decoding operations moreefficiently.

The software-based system may be applied to existing or future storagedevices, for example RAID systems using a maximum distance separable(MDS) erasure coding scheme, such as Reed-Solomon ECC with at least 2parity drives. In one embodiment a collection of registers of T bits areassociated to non-overlapping windows of T symbols, covering all Nsymbols. The values of each of the windows are reconstructed, using thesymbols outside the window, and then compared to the unused symbolsinside the window to create the values of the corresponding register. Ifthe registers are all 1's except one register, which shows all 0'sexcept one 1 then the system declares that the symbol corresponding tothat 1 is in error.

If there have been no symbol errors then all values of all registerswill show 0 (agreement). If there is exactly one symbol in error thenthe registers corresponding to all windows not covering the error willshow all 1's (one-symbol errors will cause all the reconstructed symbolsto be in error because of the structure of MDS codes). Thereconstruction corresponding to the window covering the error will becorrect, thus will only disagree with the received word at the error.Thus, in the embodiment above, the system will be able to correct allerrors of up to one symbol, with no chance of failure.

The software-based embodiment above provides a different level of dataprotection—not just recovering from a known disk failure, but recoveringfrom silent data corruption that can be added to existing and futuredata storage systems. By using more windows, thus more registers andmore computation, the system could also provide protection againsterrors larger than one symbol, although perhaps only with a level ofcertainty, not the absolute guarantee as for one symbol.

The list approach and the resulting bit vector, as described inembodiments above, reveal important information regarding theperformance of the storage or data transmission systems, which is notusually exposed using Welch-Berlekamp. This information could bedescribed as error correction “meta-data,” and could be used in manyuseful fashions. For example, in one embodiment, a communication systemcould increase its transmission rate until the meta-data revealed toomuch stress, such as decreasing confidence levels as described above. Inother embodiments, this type of feedback loop could implement whatperformance enthusiasts call “over-clocking,” but then could auto-adjustto the application environment by using the meta-data. In anotherembodiment, for a computation involving a very large data set, thismeta-data, accumulated over time, could provide a confidence level thatthe overall computational result was based on trustworthy data.

Glossary of Some Variables

C confidence of error correction (number of regenerated symbols=codewordsymbols)F number of corrected symbolsK number of data symbolsN number of codeword symbols=K+TT number of parity symbolsL length of a burst errorW weight of a burst errorS inconsistency bit vector

While the above description contains many specific embodiments of thepresent invention, these should not be construed as limitations on thescope of the present invention, but rather as examples of specificembodiments thereof. Accordingly, the scope of the present inventionshould be determined not by the embodiments illustrated, but by theappended claims and their equivalents.

What is claimed is:
 1. A system for hardware error-correcting code (ECC)detection or correction of a received codeword from an originalcodeword, the system comprising: an error-detecting circuit configuredto process a selection of symbols of the received codeword using a setof factors, the original codeword being recomputable from acorresponding said selection of symbols of the original codeword usingthe set of factors, the error-detecting circuit comprising: a hardwaremultiplier and accumulator configured to use the set of factors and theselection of symbols of the received codeword to recompute remainingsymbols of the original codeword; and a hardware comparator configuredto compare the recomputed remaining symbols of the original codewordwith corresponding said remaining symbols of the received codeword andto output first results of this comparison.